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Abstract 

We introduce a new technique for constructing a finite state deter- 
ministic automaton from a regular expression, based on the idea 
of marking a suitable set of positions inside the expression, intu- 
itively representing the possible points reached after the processing 
of an initial prefix of the input string. Pointed regular expressions 
join the elegance and the symbolic appealingness of Brzozowski's 
derivatives, with the effectiveness of McNaughton and Yamada's 
labelling technique, essentially combining the best of the two ap- 
proaches. 

Categories and Subject Descriptors F.l.l [Models of Computa- 
tion] 

General Terms Theory 

Keywords Regular expressions, Finite States Automata, Deriva- 
tives 

1. Introduction 

There is hardly a subject in Theoretical Computer Science that, in 
view of its relevance and elegance, has been so thoroughly inves- 
tigated as the notion of regular expression and its relation with fi- 
nite state automata (see e.g. [T]|2J for some recent surveys). All 
the studies in this area have been traditionally inspired by two pre- 
cursory, basilar works: Brzozowski's theory of derivatives [3J, and 
McNaughton and Yamada's algorithm |4|. The main advantages of 
derivatives are that they are syntactically appealing, easy to grasp 
and to prove correct (see 1 5 1 for a recent revisitation). On the other 
side, McNaughton and Yamada's approach results in a particularly 
efficient algorithm, still used by most pattern matchers like the 
popular grep and egrep utilities. The relation between the two ap- 
proaches has been deeply investigated too, starting from the sem- 
inal work by Berry and Sethi |6| where it is shown how to refine 
Brzozowski's method to get to the efficient algorithm (Berry and 
Sethi' algorithm has been further improved by later authors 0OD). 

Regular expressions are such small world that it is much at no 
one's surprise that all different approaches, at the end, turn out to 
be equivalent; still, their philosophy, their underlying intuition, and 
the techniques to be deployed can be sensibly different. Without 



having the pretension to say anything really original on the subject, 
we introduce in this paper a notion of pointed regular expression, 
that provides a cheap palliative for derivatives and allows a simple, 
direct and efficient construction of the deterministic finite automa- 
ton. Remarkably, the formal correspondence between pointed ex- 
pressions and Brzozowski's derivatives is unexpectedly entangled 
(see Section |4~T| l testifying the novelty and the not- so-trivial nature 
of the notion. 

The idea of pointed expressions was suggested by an attempt of 
formalizing the theory of regular languages by means of an interac- 
tive proverl At first, we started considering derivatives, since they 
looked more suitable to the kind of symbolic manipulations that 
can be easily dealt with by means of these tools. However, the need 
to consider sets of derivatives and, especially, to reason modulo as- 
sociativity, commutativity and idempotence of sum, prompted us to 
look for an alternative notion. Now, it is clear that, in some sense, 
the derivative of a regular expression e is a set of "subexpressions" 
of ^] the only, crucial, difference is that we cannot forget their con- 
text. So, the natural solution is to point at subexpressions inside the 
original term. This immediately leads to the notion of pointed reg- 
ular expression (pre), that is just a normal regular expression where 
some positions (it is enough to consider individual characters) have 
been pointed out. Intuitively, the points mark the positions inside 
the regular expression which have been reached after reading some 
prefix of the input string, or better the positions where the process- 
ing of the remaining string has to be started. Each pointed expres- 
sion for e represents a state of the deterministic automaton associ- 
ated with e; since we obviously have only a finite number of possi- 
ble labellings, the number of states of the automaton is finite. 

Pointed regular expressions allow the direct construction of the 
DFA |9| associated with a regular expression, in a way that is 
simple, intuitive, and efficient (the task is traditionally considered 
as very involved in the literature: see e.g [1 1, pag.71). 

In the imposing bibliography on regular expressions - as far 
as we could discover - the only author mentioning a notion close 
to ours is Watson 1101 II II . However, he only deals with single 
points, while the most interesting properties of pre derive by their 
implicit additive nature (such as the possibility to compute the 
move operation by a single pass on the marked expression: see 
definition|21[l. 



[Copyright notice will appear here once 'preprint' option is removed.] 



1 The rule of the game was to avoid overkilling, i.e. not make it more 
complex than deserved. 

2 This is also the reason why, at the end, we only have a finite number of 
derivatives. 
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2. Regular expressions 

DEFINITION 1. A regular expression over the alphabet E is an 
expression e generated by the following grammar: 

E :~9\e\a\E + E\EE\E* 

with a £ E 

DEFINITION 2. The language L(e) associated with the regular 
expression e is defined by the following rules: 

L(0) = 

m = m 

L(a) = {a} 

L(ei + e 2 ) = L{e 1 )uL(e 2 ) 

i(eie 2 ) = L(ei) ■ L(e 2 ) 

L(0 = L(e)* 

w/iere e ls empty string, L± ■ L 2 = { Zi ^2 /i £ Li, / 2 £ L 2 } 
W ?/ze concatenation of Li and L 2 and L* i.s f/ie so called Kleene's 
closure of L: L* = U~ L\ with L° = eandU +1 =L- L\ 

Definition 3 (nullable). 

A regular expression e is said to be nullable ife £ L(e). 

The fact of being nullable is decidable; it is easy to prove that 
the characteristic function vie) can be computed by the following 
rules: 

i/(0) = false 
vie) — true 
v(a) — false 
f(ei + e 2 ) = v{ei)V v{ez) 
f(e 1 e 2 ) = v(ei) A v(e 2 ) 
v{e*) — true 

DEFINITION 4. A deterministic finite automaton (DFA) is a quin- 
tuple (Q, E, qo, t, F) where 

— Q is a finite set of states; 

— E is the input alphabet; 

— qo £ Q is the initial state; 

— t : Q X E —¥ Q is the state transition function; 

— F C Q is the set affinal states. 

The transition function t is extended to strings in the following way: 

DEFINITION 5 . Given a function t : Q X E — > Q, the function 
t* : Q x E* — >■ Q is defined as follows: 

*/ \ J t(q, e) = q 
1 {q ' W)= \t(q,awi)=t*(t(q,a),w') 

DEFINITION 6. Let A = (Q, E, q , t, F) be a DFA; the language 
recognized A is defined as follows: 

L(A) = {w\t*{q ,w) e F} 

3. Pointed regular expressions 

Definition 7. 

1. A pointed item over the alphabet E is an expression e generated 
by following grammar: 

E ::= 0|e|o| • a\E + E\EE\E* 

with a £ E; 

2. A pointed regular expression (pre) is a pair {e, b) where b is a 
boolean and e is a pointed item. 

The term »a is used to point to a position inside the regular ex- 
pression, preceding the given occurrence of a. In a pointed regular 



expression, the boolean must be intuitively understood as the pos- 
sibility to have a trailing point at the end of the expression. 

DEFINITION 8. The carrier \e\ of an item e is the regular expres- 
sion obtained from e by removing all the points. Similarly, the car- 
rier of a pointed regular expression is the carrier of its item. 

In the sequel, we shall often use the same notation for functions 
defined over items or pres, leaving to the reader the simple disam- 
biguation task. Moreover, we use the notation e(b), where b is a 
boolean, with the following meaning: 

e(true) = {e} e(false) = 

Definition 9. 

1. The language L p (e) associated with the item e is defined by the 
following rules: 



Lp(0) 


= 




L P (e) 


= 




L p (a) 


= 




L p (»a) 


= {a} 




L p (ei + e 2 ) 


= -M £ i) 


UL p (e 2 ) 


i P (eie 2 ) 


= -M £ i) 


• L(|e 2 |) U L p (e 2 


L p (e*) 


= Me)- 


L(\e\*) 



2. For a pointed regular expression (e, b) we define 
L p ((e,b)) = L p (e)Ue(b) 

Example 10. 

L p ((a + •&)*) = L(b(a + b)*) 

Indeed, 

L p ((a + •&)*) = 

= L p (a + •&) • L(\a + •b\") 
= (L p (a)uL p (»b))-L((a + bY) 
= {b}-L((a + by) 
= L(b(a + b)*) 

Let us incidentally observe that, as shown by the previous example, 
pointed regular expressions can provide a more compact syntax for 
denoting languages than traditional regular expressions. This may 
have important applications to the investigation of the descriptional 
complexity (succinctness) of regular languages (see e.g. 1121 1131 

El). 

EXAMPLE 1 1. If e contains no point (i.e. e = ej) then L p (e) — 

LEMMA 12. Ife is a pointed item then e L p (e). Hence, e £ 
L p ({e,b)) if and only ifb = true. 

Proof. A trivial structural induction on e. 

3.1 Broadcasting points 

Intuitively, a regular expression e must be understood as a pointed 
expression with a single point in front of it. Since however we 
only allow points over symbols, we must broadcast this initial point 
inside the expression, that essentially corresponds to the e-closure 
operation on automata. We use the notation •(■) to denote such an 
operation. 

The broadcasting operator is also required to lift the item con- 
structors (choice, concatenation and Kleene's star) from items to 
pres: for example, to concatenate a pre (ei, true) with another pre 
(e 2 , & 2 ), we must first broadcast the trailing point of the first expres- 
sion inside e 2 and then pre-pend e\ ; similarly for the star operation. 
We could define first the broadcasting function •(■) and then the 
lifted constructors; however, both the definition and the theory of 
the broadcasting function are simplified by making it co-recursive 
with the lifted constructors. 
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Definition 13. 



Proof.[of 2.] We need to prove L p (ei © e 2 ) = L p {e\) U Lp(e 2 ). 



7. The function • (■)from pointed item to pres is defined as follows: 



• (a 
• (•a 

• (ei + e2 
•(eie 2 



(0,/afee) 

(e, £rae) 

(•a, false) 

(•a, false) 

•(ei) © »(e 2 ) 

•(ei) (e2, false) 

(e'*, trite) w/iere • (e) = (e',6'} 



Tfte /;y/ed constructors are defined as follows 
(e' 1 ,b' 1 )(B{e' 2 ,bi 2 ) = {e 1 +e 2 ,b' 1 Wb' 2 ) 



{eW 2 ,b' 2 ) 
(e'i.fei) 0(e 2 ,6 2 H (eie 2 ',6 2 V62> 



w/jen 6'x = /a/se 
w/jen 6'x = true 
and»(e 2 ) = (e 2 , 



6 2 ') 



e >7 



(e * , false) 
(e"* , trite) 



wfterc 6' = 
and »(e') 



false 
true 

= {e",b") 



The apparent complexity of the previous definition should not hide 
the extreme simplicity of the broadcasting operation: on a sum we 
proceed in parallel; on a concatenation eie 2 , we first work on ei 
and in case we reach its end we pursue broadcasting inside e 2 \ in 
case of e* we broadcast the point inside e recalling that we shall 
eventually have a trailing point. 

EXAMPLE 14. Suppose to broadcast a point inside 

(a + e)(b*a + b)b 

We start working in parallel on the first occurrence of a (where the 
point stops), and on e that gets traversed. We have hence reached 
the end of a + e and we must pursue broadcasting inside (b* a + b)b. 
Again, we work in parallel on the two additive subterms b* a and 
b; the first point is allowed to both enter the star, and to traverse it, 
stopping in front of a; the second point just stops in front ofb. No 
point reached that end ofb*a + b hence no further propagation is 
possible. In conclusion: 

• ((a + e)(b*a + b)b) = (»a + e)((»b)* • a + »b)b 

DEFINITION 15. The broadcasting function is extended to pres in 
the obvious way: 

• «e,6)) = (e',bVb') where • (e) = (e, &') 

As we shall prove in Corollary[T8] broadcasting an initial point may 
reach the end of an expression e if and only if e is mailable. 
The following theorem characterizes the broadcasting function and 
also shows that the semantics of the lifted constructors on pres is 
coherent with the corresponding constructors on items. 

Theorem 16. 

1. L p (»e) = L p {e)ULQe\). 

2. L p (e± © e 2 ) = L p (e 1 ) U L p (e 2 ) 

3. L p (ei e 2 ) = L p (ei) ■ L(|e 2 |) U L p (e 2 ) 

4. L p (e*) = L p (e) • L(\e\y 

We do first the proof of 2., followed by the simultaneous proof of 
1. and 3., and we conclude with the proof of 4. 



£p«ei,&i}©<e 2 ,& 2 » = 
= L p ((ei+e' 2) 6' 1 V 6 2 » 
= L p (e' 1 + e' 2 )Ue(b' 1 )Ue(b' 2 ) 
= L p (e' 1 )Ue(b' 1 )uL p (e' 2 )Ue(b' 2 ) 
= Lp(ei) U L p (e 2 ) 



Proof.[of 1. and 3.] We prove 1. (L p (me) = L p (e) U L(|e|)) by 
induction on the structure of e, assuming that 3. holds on terms 
structurally smaller than e. 

— L„(.(0)) = L P ({(D, false)) = = L p (0) U L(|0|). 

" L p (»(e)) = L p ((e, true)) = {e} = L p (e) U L p (\e\). 

— L p (»(a)) = L p ({a, false)) = {a} = L p (a) U L(\a\). 

— L p (»(»a)) = L p ((»a, false)) = {a} = L p (»a) U L(\ • a\). 

— Let e = ei + e 2 . By induction hypothesis we know that 

L p (»(e t )) =L p (e<)UL(|ej|) 

Thus, by 2., we have 

L p (»(ei + e 2 )) = 
= L p (»(ei) ffi»(e 2 )) 
= Lp(«(ei))UL,(«(e 3 )) 

= L„(ei) U L(|ei|) U L p (e 2 ) U L(|e 2 |) 
= L p (ei + e 2 ) U L(|ei + e 2 |) 

— Let e = e\e 2 . By induction hypothesis we know that 

L p (»(e t )) =L p (e<)UL(|ej|) 

Thus, by 3. over the structurally smaller terms e\ and e 2 

£p(»(eie 2 )) = 

= L p (»(ei) (e 2 , false)) 

= L p (.( Cl ))-L(|e 2 |)UL p (e 2 ) 

= (L p (e 1 )uL(| ei j))- J L(|e 2 |)uL p ( e2 ) 

= L p (ei) ■ L(|e 2 |) U L(|ei|) ■ L(|e 2 |) U L p (e 2 ) 

= L p (eie 2 ) U L(|eie 2 |) 

— Let e = e*. By induction hypothesis we know that 

L p (.( ei )) = L p (ei) U e(6'i) = Lp(er) U L(|ei|) 



and in particular, since by Lemma 12 e £ L p (ei) 



Lp(ei) = Lp(ei)U(L(| ei |)\e(6i)) 

Then, 

£*(•(«*)) = 

= L p ({e'i ,true)) 

= L p (e' 1 *)Ue 

= L p (ei)i(|ei|)Ue 

= (Lp(ex)U(X(|ei|)\e(6i)))L(|ei|)Ue 
= L p ( ei )L(| ei |) U (L(|ei|) \ e(bi)Wlell) U 6 
= L p (e 1 )L(|e 1 |)uL(|et|) 
= Lp(eI)UL(|eI|) 

Having proved 1. for e assuming that 3. holds on terms structurally 
smaller than e, we now assume that 1. holds for ei and e 2 in order 
to prove 3.: L p (ei © e 2 ) = L p (ei) • L(|e 2 |) U L p (e 2 ) 
We distinguish the two cases of the definition of 0: 
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L p ({e' u false)0(e' 2 ,b' 2 )) = 
= L„((eie' 2) 6i)) 
= L p (e' 1 e' 2 )Ue(b' 2 ) 
= L p (ei).L(|e 2 |)UL p (e 2 )Ue(6' 2 ) 
= Lp(ei) ■ L(|e 2 |) U L p (e 2 ) 

L p «ei,irue)0<e 2 ,6 2 » = 
= ip({e' ie2 ',6 2 V6 2 ')) 
= i p (eie 2 ') U e(6' 2 ) U e(6' 2 ') 
= L p (ei) • L(|e' 2 '|) U L p (e 2 ') U e(b' 2 ) U e(6 2 ') 
= L p (ei) • L(|e' 2 '|) U L p (e 2 ) U L(|e' 2 |) U e(6 2 ) 
= (L p (ei) U e(irue)) ■ L(|e 2 |) U L p (e 2 ) U e(fe 2 ) 
= L p ( ei )-L(|e 2 |)uL p (e 2 ) 

Proof.[of 4.] We need to prove L p (e*) = L p (e) • L(\e\)* . We 
distinguish the two cases of the definition of •*: 

L p ((e', false)*) = 
= L p ((e'*, false)) 
= Lp{e'") 
= L p (e').L(\e'\y 
= (L p (e')Ue(false))-L(\e'\y 
= L p (e) ■ L(\e\)* 

L p {(e' , true)*) — 

= L p ((e"*,true}) U e 

= L p (e"*)Ue 

= L p ( e ")-L(|e"|)*U £ 

= (X p (e')UL(|e'|)).L(|e"|rUe 

= L p (e>)-L(\e"\)uL(\e'\)-L(\e"\yUe 

= L p (e')-L(\e"\)uL(\e'\y 

= (L p (e')Ue(true))-L(\e"\) 

= L p (e) ■ L(\e\y 

COROLLARY 17. For any regular expression e, L(e) — L p (»e). 

Another important corollary is that an initial point reaches the 
end of a (pointed) expression e if and only if e is able to generate 
the empty string. 

COROLLARY 18. »e = (e', true) if and only if e G £(|e|). 



Proof. By theorem 16 we know that L v (»e) = L p (e) U L(\e\). 
So, if e 6 L p (»e), since by Lemma 12 e ^ L p (e), it must 
be e G ^(|e|). Conversely, if e G i(|e| J then e G L p (»e); if 

• e = (e', fo), this is possible only provided b = true. 

To conclude this section, let us prove the idempotence of the 

• (■) function (it will only be used in Section[5] and can be skipped 
at a first reading). To this aim we need a technical lemma whose 
straightforward proof by case analysis is omitted. 

Lemma 19. 1. »(ei e 2 ) = «(ei) »(e 2 ) 
2. •(ei0e 2 )=»(ei)0e 2 

THEOREM 20. •(•(e)) = »(e) 

Proof. The proof is by induction on e. 

— »(»(0)) = •({<&, false)) = (0, false V false) =.(0) 

— »(»(e)) = «((e, true)) = (e, irite V true) = »(e) 

— •(•(a)) = •{(•a, false)) = {»a, false V false) = »(a) 

— •(•(•a)) = •({•a, false)) = {»a, false V false) = •(•a) 

— If e is ei + e 2 then 

•(•(ei + e 2 )) = •(•(ei) © «(e 2 )) = •(•(e 1 )) © •(•(e 2 )) = 
= «(ei) © »(e 2 ) = »(ei + e 2 ) 



— If e is eie 2 then 

•(•(eie 2 )) = •(•(ei) (e 2 , false)) • (»(ei)) (e 2 , false) = 
= »(ei) (e 2 , false) = »(eie 2 ) 

— If e is e*, let «(ei) = (e',b') and let »(e') = (e",b"). By 
induction hypothesis, 

<e', 6') = .(ei) = .(.(er)) = .«e', b')) = (e" , b' V b") 

and thus e' = e". Finally 

• (•(e*)) = m({e'*,true)) = {e"* ,trueV b") = (e",true) = 

3.2 The move operation 

We now define the move operation, that corresponds to the ad- 
vancement of the state in response to the processing of an input 
character a. The intuition is clear: we have to look at points inside 
e preceding the given character a, let the point traverse the charac- 
ter, and broadcast it. All other points must be removed. 

Definition 21. 

1. The function move(e, a) taking in input a pointed item e, a 
character a £ £ and giving back a pointer regular expression 
is defined as follow, by induction on the structure of e: 

move(fb,a) = (0, false) 

move(e,a) = (e, false) 

move(b,a) = {b, false) 

move(»a, a) = {a, true) 

move(»b, a) = (6, false) ifb 7^ a 

move(ei + e 2 , a) = move(ei, a) © mcwe(e 2 , a) 

move(eie2,a) = move(ei, a) mone(e 2 , a) 

move(e*,a) = move(e,a)* 

2. The move function is extended to pres by just ignoring the 
trailing point: move({e, b),a) = move(e, a) 

EXAMPLE 22. Let us consider the pre (»a + e) ((•&)* • a + »b)b 

and the two moves w.r.t. the characters a and b. For a, we have 
two possible positions (all other points gets erased); the innermost 
point stops in front of the final b, the other one broadcast inside 
(b*a + b)b, so 

move((»a+e)((»by »a+»b)b, a) — ((a+e)((»6)* »a+»b)»b, false) 

For b, we have two positions too. The innermost point still stops in 
front of the final b, while the other point reaches the end of b* and 
must go back through b* a: 

move((»a+e)((»by »a+»b)»b, b) = {(a+e)((»b)*»a+b)»b, false) 

THEOREM 23. For any pointed regular expression e and string w, 
w £ L p (move(e, a)) <4> aw G L p (e) 

Proof. The proof is by induction on the structure of e. 

— if e is atomic, and e is not a pointed symbol, then both 
L p (move(e,a)) and L p (e) are empty, and hence both sides 
are false for any w; 

— if e = »a then L p (move(»a, a)) = L p ({a, true)) — {e} and 
L p (»a) = {a}; 

— if e = »b with b 7^ a then L p (move(»b, a)) — L p ({b, false)) = 
and L p (»b) — {b}; hence for any string w, both sides are 
false; 

— if e = ei+e 2 by induction hypothesis w G L p (move(ei,a)) 
aw G L p (ei), hence, 
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w G L p (move(ei + a)) <4> 

<4> to G L v (move(e\,a) © mcwe(e2, a)) 
<4> to G Lp(move(ei, a)) U L p (move(e2, a)) 
<4> (to G L p (move(ei, a))) V (to G L p (move(e2,a))) 
<4> (aTO G Lp(ei)) V (aTO G L p (e2)) 
<S> aTO G ip(ei) U L p (e 2 ) 
<4> aw G L p (ei + e2) 

— supposee = ei62, by induction hypothesis to G L p (move(ei, a)) 4» 
aw G Lp(ei), hence, 

to G L p (move(eie2, a)) <4> 

to G Lp(move(ei,a) racwe(e2, a)) 
<=> to G L p (move(ei, a)) • i|e2 j U L p (move(e2, a)) 
<=> to G L p (move(ei, a)) • £|e 2 | V to G L p (move(e2, a)) 
(3toi, TO2, to = TO1TO2 A wi G L p (move(ei, a)) 
ATO2 G L(|e2|)) V to G L p (move(e2,a)) 
(3toi, W2, to = TO1TO2 A awi G £j>(e) 
Ato 2 G L(\e 2 \)) V aw G L p (e 2 ) 
<S> (aw G Lp(ei) ■ L|e 2 |) V (aw G L p (e2)) 
<^ aw G ip(ei) ■ L|e2|U G L p (e2) 
<4> aw G L v (e\e2) 

— supposee = ej, by induction hypothesis to G L p (move(ei,a)) 4=> 
a-ro G ip(ei), hence, 

to G Z/ p (moue(e*, a)) <4> 
<=> to G L p (move(ei, a))* 

to G L p (move(e 1 , a)) ■ L(|mo«e(ei, a)|)* 
44- 3wi, TO2, to = TO1TO2 A toi G Lp(move(ei, a)) 

Ato 2 G i(|ei|)* 
4=> 3toi, TO2, w = TO1TO2 A aTOi G Lp(ei) A to 2 G L(|ei|)* 
O aTO G -Lp(ei) ■ L(|ei|)* 
<^4> aw G ip(ei) 

We extend the move operations to strings as usual. 

Definition 24. 

move* (e, e) — e move* (e, aw) = move*(move(e,a),w) 
THEOREM 25 . For any pointed regular expression e and all strings 

f3 G L p (move* (e, a)) <4> a/3 G L p (e) 

Proof. A trivial induction on the length of a, using theorem [23] 

COROLLARY 26. For any pointed regular expression e and any 
string a, 

a G L p (e) 3e', L p (move* (e, a)) = (e', true) 
Proof. By Theorems[25]and Lemma[T2"l 
3.3 From regular expressions to DFAs 

DEFINITION 27. To any regular expression e we may associate a 
DFA D e — (Q, E, qo,t, F) defined in the following way: 

— Q is the set of all possible pointed expressions having e as 
carrier; 

— E is the alphabet of the regular expression 

— qo is »e; 

— t is the move operation of definition \21\ 

— F is the subset of pointed expressions (e, b) with b = true. 

Theorem 28. L(D e ) = L(e) 
Proof. By definition, 

to G L(D e ) o move* (»(e), to) = (e',true) 



for some e'. By the previous theorem, this is possible if an only if 
to G Lp(»(e)), and by corollary 17 L p (»(e)) — L(e). 



REMARK 29. The fact that the set Q of states of D e is finite is 
obvious: its cardinality is at most 2" +1 where n is the number 
of symbols in e. This is one of the advantages of pointed regular 
expressions w.r.t. derivatives, whose finite nature only holds after 
a suitable quotient, and is a relatively complex property to prove 
(see^j). 

The automaton D e just defined may have many inaccessible states. 
We can provide another algorithmic and direct construction that 
yields the same automaton restricted to the accessible states only. 

DEFINITION 30. Let e be a regular expression and let qo be »e. 
Let also 

Qo ■= {qo} 

Qn+i ■= Qn U {e'|e' ^ Q„ A 3a.3e G Q„.move(e, a) = e'} 

Since every Q n is a subset of the finite set of pointed regular 
expressions, there is an m such that Q m +i = Qm- We associate 
to e the DFA D e = (Q m , S, qo, F, t) where F and t are defined as 
for the previous construction. 




Figure 1. DFA for (a + e)(b*a + b)b 

In Figure [TJ we describe the DFA associated with the regular 
expression (a + e)(b*a + b)b. The graphical description of the 
automaton is the traditional one, with nodes for states and labelled 
arcs for transitions. Unreachable states are not shown. Final states 
are emphasized by a double circle: since a state (e, 6} is final if and 
only if b is true, we may just label nodes with the item (for instance, 
the pair of states 6 — 8 and 7 — 9 only differ for the fact that 6 and 
7 are final, while 8 and 9 are not). 

3.4 Admissible relations and minimization 

The automaton in Figure[T]is minimal. This is not always the case. 
For instance, for the expression (ac+bc)* we obtain the automaton 
of Figure [2] and it is easy to see that the two states corresponding 
to the pres(a • c + be)* and (ac + b • c)* are equivalent (a way to 
prove it is to observe that they define the same language). 
The latter remark, motivates the following definition. 
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(a c + b c) 



alblc 



Figure 2. DFA for (ac + be)* 



DEFINITION 3 1 . An equivalence relation ss over pres having the 
same carrier is admissible when for all e\ and e^ 

— ife\ w e2 then L p (ei) = L v (ei) 

— ifei w 62 thenfor all a move(ei, a) w mcwe(e2,a) 

DEFINITION 32. To any regular expression e and admissible 
equivalence relation over pres over e, we can directly asso- 
ciate the DFA D e /~ — (Q/~,Y,,[qo]~,move* /~, F/~) where 
move* I ~ is the move* operation lifted to equivalence classes 
thanks to the second admissibility condition. 

In place of working with equivalence classes, for formalization 
and implementation purposes it is simpler to work on representative 
of equivalence classes. Instead of choosing a priori a representative 
of each equivalence class, we can slightly modify the algorithmic 
construction of definition |30] so that it dynamically identifies the 
representative of the equivalence classes. It is sufficient to read each 
element of Q n as a representative of its equivalence class and to 
change the test e' ^ Q„ so that the new state e' is compared to the 
representatives in Q n up to ~: 

DEFINITION 33. In definition\30\change the definition of Q 71 + 1 &S 
follows: 



Q n U {e'|3a.3e G Q n .move(e,a) = e'A ~3e" G 



The transition function t is defined as t(e, a) = e' where move(e, a) 
e" and e is the unique state of Q m such that e = e" . 

In an actual implementation, the transition function t is computed 
together with the sets Q n at no additional cost. 

TH EOR EM 34. Replacing each state e of the automaton of defini- 
with [e] / ~, we obtain the restriction of the automaton of 



33 



definition \32\ to the accessible states. 

We still need to prove that quotienting over w does not change 
the language recognized by the automaton. 

Theorem 35. L(D e /~) = L(e) 

Proof. By theorem [28| it is sufficient to prove L(D e ) = L(D e /~) 
or, equivalently, that for all w, mone* /~( [go] /~, w) G F/~ <=>■ 
move* (go, w) G F. We show this to hold by proving by induction 
over w that for all q 

[move* (q,w)]/~ = move* /~([g]/~,tu) 

Base case: move* /tn([q]/tn, e) = [q]/~ = [move*(q, e)]/« 

Inductive step: by condition (2) of admissibility, for all 

qi G [qo]/~, we have move(qi,a) « move(qo, a) and thus 

move/tz>([qo]/'&,a) = [move(qo, o)]/w 

Hence moue*/~([go]/~, aw) = 

= moue*/si(mcwe/«([ao]/~, a), w) 
= move* /Ki{\move(qo, a)]/«, to) 
= [move*(mcwe(go, a), 
= [move* (go, aw)] /s=s 



The set of admissible equivalence relations over e is a bounded 
lattice, ordered by refinement, whose bottom element is syntactic 
identity and whose top element is e\ w e2 iff L(ei) = He-z). 
Moreover, if Wj <~2 (the first relation is a strict refinement of the 
second one), the number of states of D e /~i is strictly larger than 
the number of states of D e /~2. 

THEOREM 36. If~ is the top element of the lattice, than is 
the minimal automaton that recognizes L(e). 

Proof. By the previous theorem, D e /~ recognizes L(e) and has 
no unreachable states. By absurd, let D' — (Q r , E', qb, t', F') 
be another smaller automaton that recognizes L(e). Since the 
two automata are different, recognize the same languages and 
have no unreachable states, there exists two words Wi , W2 such 
t'(qb,wi) = t'(q' ,w 2 ) but [ei]/w = move* /^([q ]/^i, wi) =fc 
mowe*/~([go]/~, W2) = [e2]/~ where e\ and 62 are any two 
representatives of their equivalence classes and thus e\ 56 e2- 
By definition of ~, L p (ei) 7^ L p (ea). Without loss of gener- 
ality, let W3 G Lp(ei) \ L p (e,2). We have W1W3 G L(e) and 
W2W3 L(e) because D e /~ recognizes L(e), which is absurd 
since t'(qb, W1W3) = t'(q' , W2W3) and D' also recognizes L(e). 

The previous theorem tells us that it is possible to associate to 
each state of an automaton for e (and in particular to the minimal 
automaton) a pre e' over e so that the language recognized by the 
automaton in the state e' is L p (e'), that provides a very suggestive 
labelling of states. 

The characterization of the minimal automaton we just gave 
does not seem to entail an original algorithmic construction, since 
does not suggest any new effective way for computing w. How- 
ever, similarly to what has been done for derivatives (where we 
have similar problems), it is interesting to investigate admissible 
relations that are easier to compute and tend to produce small au- 
tomata in most practical cases. In particular, in the next section, we 
shall investigate one important relation providing a common quo- 
tient between the automata built with pres and with Brzozowski's 
derivatives. 



4. Read back 

Intuitively, a pointed regular expression corresponds to a set of 
regular expressions. In this section we shall formally investigate 
this "read back" function; this will allow us to establish a more 
syntactic relation between traditional regular expressions and their 
pointed version, and to compare our technique for building a DFA 
with that based on derivatives. 

In the following sections we shall frequently deal with sets of 
regular expressions (to be understood additively), that we prefer to 
the treatment of regular expressions up to associativity, commuta- 
tivity and idempotence of the sum (ACI) that is for instance typical 
of the traditional theory of derivatives (this also clarifies that ACI- 
rewriting is only used at the top level). 

It is hence useful to extend some syntactic operations, and 
especially concatenation, to sets of regular expressions, with the 
usual distributive meaning: if e is a regular expression and S is a 
set of regular expressions, then 



Se = {t 



We define eS and S1S2 in a similar way. Moreover, every function 
on regular expressions is implicitly lifted to sets of regular expres- 
sions by taking its image. For example, 



L(S) = U He) 
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DEFINITION 37. We associate to each item e a set of regular ex- 
pressions R(e) defined by the following rules: 



fl(0) 


= 


R(e) 


= 


R(a) 


= 


R(ma) 


= W 


R(ei + e 2 ) 


= _R(ei)u7?(e 2 ) 


R(eie 2 ) 


= R(ei)\e 2 \UR(e 2 


R(e*) 


= fl(e)|e|* 



R is extended to a pointed regular expression (e, b) as follows 

R{{e,b)) = i?(e) U e(6) 

Note that, for any item e, no regular expression in R(e) is mailable. 

EXAMPLE 38. Since »((a + e)b*) = ((»a + e)(»b)* ,true) we 
have i?(»((a + e)b*)) = {ah* , bb* , e} 

The parallel between the syntactic read-back function R and 
the semantics L p of definition|9]is clear by inspection of the rules. 
Hence the following lemma can be proved by a trivial induction 
over e. 

Lemma 39. L(R(e)) = L p {e) 

COROLLARY 40. For any regular expression e, L(R(»(e))) = 

L(e) 

The previous corollary states that R and •(■) are semantically 
inverse functions. Syntactically, they associate to each expression 
e an interesting "look-ahead" normal form, constituted (up to as- 
sociativity of concatenation) by a set of expressions of the kind 
ae a (plus e if e is mailable), where e a is a derivative of e w.r.t. 
a (although syntactically different from Brzozowski's derivatives, 
defined in the next section). 

This look-ahead normal form (n/) has an interest in its own, 
and can be simply defined by structural induction over e. 

Definition 41. 

nf (0) = 

n/(e) = 
nf(a) = {a} 

nf(ei + e 2 ) = n/(ei) U nf(e 2 ) 
nf(eie2) = nf{ef)e 2 ifv(ef) — false 
nf(eie 2 ) = n/(ei)e 2 U n/(e 2 ) ifv(ef) = true 
nfifi*) = n/(e)e* 

REMARK 42. It is easy to prove that, for each e, the set nf(e) is 
made, up to associativity of concatenation, only of expressions of 
the form a or ae a . In particular no expression in nf(e) is nullable! 

The previous remark motivates the following definition. 

Definition 43. nf e (e) = n/(e) u e(z/(|e|)) 

The main properties of nf e are expressed by the following two 
lemmas, whose simple proof is left to the reader. 

Lemma 44. 

nfM = 
nfM = {e} 

n fM) = {>} 

«/ e (ei + e 2 ) = n/ E (ei) U n/ £ (e 2 ) 
n/ e (eie 2 ) = n/ E (ei)e 2 i/^(ei) = false 
nf c (eie 2 ) = n/(ei)e 2 U n/ e (e 2 ) ifv(ef) = true 
nf e (e*) = nf(e)e*Ue(v(e)) 

Theorem 45. L(e) = L(nf e {e)) 



THEOREM 46. For any pointed regular expression e, 
R(.(e)) = nf c (\e\)uR(e) 

Proof. Let «(e) = <e',b'); then e £ R(»(e)) iff 6' = true, iff 
i^(|e|) = true. Hence the goal reduces to prove that R(e') = 
nf\e\ U R(e). We proceed by induction on the structure of e. 

— e = 0, .(0) = 1$, false) and i?(0) = = n/(0) 

— e = e, «(e) = (e, frwe) and -R(e) = = n/(e) 

— e = a: (•(«)) = (•a, false) and R(»a) = {a} = nf(a) 

— e — »a: (•(•a)) = {ma, false) and R(»a) — {a} — nf(a) = 
n/(| • a|) = n/(| • a|) U R(»a) 

— e = ei + e 2 : let »(ei + e 2 ) = (e'j + e 2 , 6); then 

fl(ei + ei) = 
= R(e[) U i?(e 2 ) 

= n/(|ei|) U R( ei ) U n/(|e 2 |) U R{e 2 ) 
= nf\ei + e 2 | U i?(ei + e 2 ) 

— e = eie 2 . Let »(ei) = (e£,b;). If b^ = /a/se then »(eie 2 ) = 
{e'ie 2 , false); moreover we know that e\ is not nullable. We 
have then: 

R{el x e 2 ) = 

= R(e' 1 )\e 2 \UR(e 2 ) 
= (nf(\ ei \) U R( ei ))\e 2 \U R(e 2 ) 
= (n/(|ei|)|e 2 |U.R(ei)|e 2 |U J R(e 2 ) 
= n/(|eie 2 |)Ui?( ei e 2 ) 

If b'i = irue then »(eie 2 ) = {e'ie 2 ,b 2 ); moreover we know 
that ei is nullable. 

ii(eie 2 ) = 

= J R(ei)|e 2 |UA(ei) 

= (n/(|ei| U i?(e 1 ))je 2 | U n/(|e 2 |)) U R{e 2 ) 
= n/(|ei|)|e 2 | U nf(e 2 ) U #(ei)|e 2 | U R(e 2 ) 
= {nf(\e 1 e 2 \))uR{e 1 e 2 ) 

— e = e*. Let «(ei) = (ej, b£); then •(e*) = (e^*, frue); 

fl(e?) = 
= «(ei)| ei |* 
= (n/(ei)U7i(ei))|eir 
= n/(ei)|ei|*Uit(ei))|eir 
= nf(e* 1 )UR(et) 

COROLLARY 47. For all regular expression e, R(m(e)) = nf t (e) 

To conclude this section, in analogy with what we did for the 
semantic function in Theorem [16] we express the behaviour of R 
in terms of the lifted algebraic constructors. This will be useful in 
Theoreml51l 

Lemma 48. 

1. R(e x e 2 ) = R(ei) U R(e 2 ) 

2. fl((ei, false) e 2 ) = R(e[)\e 2 \ U R(e 2 ) 

3. R({e' u true) Q e 2 ) = R(e[)\e 2 \U nf e (\e 2 \) U R(e 2 ) 

4. R({e' u false)*) = fl(ei)|e*| 

5. R((e{,true)*) = fl(ei)|el| U n/ £ (|ej|) 

Proof. Let ej = (ej, b^}: 

1. Ji(ei©e 2 ) = 

= i?((ei,6i) ®langlee' 2 ,b' 2 )) = 

= i?((ei+e 2 ,bi V6i)) 

= fl(ei + e' 2 ) U e(bi V bi) 

= i?(e 1 )U J R(ei)Ue(bi)'Ue(6i) 

= ii(ei)Ue(6i)U J R(ei)Ue(6i) 

= R(e 1 )uR(e 2 ) 
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2. R((e[, false) &{e' 2 ,b' 2 )) = 

= i?«ei<4& 2 )) 

= R(e[)\e 2 \UR(e' 2 )Ue(b' 2 ) 

= R(e[)\e 2 \ U 7?(e 2 ) 

3. let.(ei) = 

R({e[,true) © (e 2) b 2 )) = 
= fl(( e ie 2 ',b 2 V& 2 ')) 
= i?(e' 1 )|e 2 |Ui?( e / 2 ')U e ^ / )U6^) 
= R(e' 1 )\e 2 \UR(.(e' 2 ))Ue(b' 2 ) 
= (i?(ei)|e 2 | U nf{\e 2 \) U R(e' 2 ) U e(b 2 ) 
= i?( ei )|e 2 |Un/(|e 2 |)Uii(e 2 ) 

4. JJ({ei, Me)*) = R((eTjalse)) = fl(ei*) = J2(ei)|eJ| 

5. let .(el) = (ei',61'); then B(.(ei)) = R(e'{) U e(b'{) = 
n/ 6 (|ei|)U-R(ei),and ^(e'/) = nf(\e 1 \)UR(e' 1 ). 

i?((ei,true)*) = 
= R{{e'i* ,true)) 
= R(e'()\el\\J e(true) 
= (R(e[) Udn/(|ei|))|e*| U e(irue) 
= 7?(ei)|e*| Udn/(|ei|)|e*| U e(true) 
= fl(ei)|el|un/(|el|) 

4.1 Relation with Brzozowski's Derivatives 

We are now ready to formally investigate the relation between 
pointed expressions and Brzozowski's derivatives. As we shall see, 
they give rise to quite different constructions and the relation is less 
obvious than expected. 

Let's start with recalling the formal definition. 

Definition 49. 

d a (0) = 
0«(e) = 
d a (a) = e 

d a (b) = 

d a (ei+e 2 ) = <9 a (ei) + <9 a (e 2 ) 

d a (eie 2 ) = d a {e\)e 2 if not y{e\) 

d a (eie 2 ) = <9 a (ei)e 2 + <9 a (e 2 ) ifv(e 1 ) 

d a {e*) = <9 a (e)e* 

Definition 50. 

c\(e) = e 
0«u>(e) = d w (d a (e)) 

In general, given a regular expression e over the alphabet E, the 
set {^(e) | w £ E*} of all its derivatives is not finite. In order 
to get a finite set we must suitably quotient derivatives according 
to algebraic equalities between regular expressions. The choice of 
different set of equations gives rise to different quotients, and hence 
to different automata. Since for finiteness it is enough to consider 
associativity, commutativity and idempotence of the sum (ACI), the 
traditional theory of Brzozowski's derivatives is defined according 
to these laws (although this is probably not the best choice from a 
practical point of view). 

As a practical example, in Figure [5] we describe the automata 
obtained using derivatives relative to the expression (ac + be)* 
(compare it with the automata of Figure 13). Also, note that the 
vertically aligned states are equivalent. 

Let us remark, first of all, the heavy use of ACI. For instance 

d a {(ac + be)*) = (ec + 0c)(ac + be)* 

while 

d b ((ac + bc)*) = (0c + ec)(ac + 6c)* 




Figure 3. Automaton with Brzozowski's derivatives 



and they can be assimilated only up to commutativity of the sum. 
As another example, 

<9 tt ((0c + 0)(ac + 6c)* + {($c + e)(ac + bc)*) = 
= (0c + 0)(ac + be)* + 

((0c + 0)(ac + be)* + (ec + 0c)(ac + be)*) 

and the latter expression can be reduce to 

(0c + 0)(oc + be)* + (ec + 0c)(ac + be)*) 

only using associativity and idempotence of the sum. 

The second important remark is that, in general, it is not true 
that we may obtain the pre-automata by quotienting the derivative 
one (nor the other way round). For instance, from the initial state, 
the two arcs labelled a and b lead to a single state in the automata 
of Figure[3] but in different states in the automata of Figure|2] 

A natural question is hence to understand if there exists a com- 
mon algebraic quotient between the two constructions (not exploit- 
ing minimization). 

As we shall see, this can be achieved by identifying states with 
a same readback in the case of pres, and states with similar look- 
ahead normal form in the case of derivatives. 

For instance, in the case of the two automata of Figures|2]and[3] 
we would obtain the common quotient of Figure|4] 




Figure 4. A quotient of the two automatons 



The general picture is described by the commuting diagram of 
Figure |5] whose proof will be the object of the next section (in 
FigurepTTO obviously stands for the string a\ . . . a n ). 

4.2 Formal proof of the commuting diagram in Figure [5] 

Part of the diagram has been already proved: the leftmost triangle, 
used to relate the initial state of the two automata, is Corollary |47j 
the two triangles at the right, used to relate the final states, just 
states the trivial properties that e G R({e,b)) iff and only if 
b = true (since no expression in R(e) is mailable), and e £ nf € (e) 
if and only if e is nullable (see Remark[42|. 

We start proving the upper part. We prove it for a pointed item e 
and leave the obvious generalization to a pointed expression to the 
reader (the move operation does not depend from the presence of a 
trailing point, and similarly the derivative of e is empty). 

THEOREM 5 1 . For any pointed item e, 

R(move(e, a)) = nf t (d a (R(e))) 

Proof. By induction on the structure of e: 



short description of paper 



8 



2010/10/14 



move*(_,w) 




R 



8 a, nfc 



nf. 



R 


R 


\ snd 




- 5 a„ nf e , 










nf E 


nf E 

§a 

" n 


An 




Figure 5. Pointed regular expressions and Brzozowski's deriva- 
tives 



the cases 0, e, a and b are trivial 

if e = »a then move(»a, a) — {a, true) and R{a,true) = 
{e}. On the other side, nf E (d a (R(»a)) = nf e {d a ({a})) = 

tf.({e})=& 

if e = ei + e2, then 

R(move(ei + e 2 , a)) — 

= R(move(ei, a) © mo«e(e 2 , a)) 
= R(move(ei, a)) U R(move(e2, a)) 
= n/ e (3 (fl(ei)))Un/ e (a o (fl(e 2 ))) 
= n/ E (a a (7?( ei + e 2 ))) 

lete = eie2, andletus suppose that move(ei, a) = (ei, false) 
and thus R(move(ei, a) = -R(e'i) and i/(3 a (ii(ei))) = false. 
Then 

R(move(eie 2 , a)) — 

= R(move(ei, a) mo«e(e 2 , a)) 

= R(move(ei, a))\move(e2, a)| U R(move(e2, a)) 

= n/ e (a a (i?(ei)))|e 2 | U nf e (d a (R(e 2 ))) 

= nf e {d a (R{ ei ))\e 2 \Ud a (R(e 2 ))) 

= nf € {d a (R{ ei )\e 2 \)Ud a (R(e 2 ))) 

= nf t (d*{R{ei)\e a \UR{e a ))) 

= n/ E (a a (7?( ei e 2 ))) 

If mowe(ei, a) = {e[, true) then R(move(e\, a)) — R(ei) U 
e = nf € {d a {R{ei)). In particular fl(e'i) = nf(d a (R(ei)) and 
f(9 a (i?(ei))) = true. We have then: 

i?(mcwe(eie2, a)) = 

= R(move(ei, a) rao«e(e2, a)) 

= -R(e' 1 )|mowe(e2, a)| U nf e (\move(e2, a)|) U R(move(e2- 

= 7?(ei)|e 2 | U n/ e (|e 2 |) U i?(mcwe(e 2 , a)) 

= n/(9 a ( J R(ei)))|e 2 | U n/ £ (|e 2 |) U nf e (d a (R(e 2 ))) 

= rtf e (0»(ii(ei))|e a |) U nf t (d a (R(e 2 ))) 

= nf t (d a (R{ei))\e 2 \ Ud a (R(e 2 ))) 

= nf t (d a (R(e 1 )\e 2 \)Ud a {R(e 2 ))) 

= nf e (d a (R( ei )\e 2 \ UR(e 2 ))) 

= nf e {d a {R(e x e 2 ))) 



— let e = ej, and let us suppose that move(e\, a) = {e[, false). 
Thus e (jL nf e (d a (R(ei))). Then 

R(move(el, a)) = 
= R(move(ei, a)*) 
= R(e'iM\ 
= rc/ e (G> a 0R( ei )))|e!| 
= n/ e (a a (ii(ei))|el|) 
= nf e (d a (R( ei )\et\))) 
= <(3 a (i?( ei ))) 

If mowe(ei, a) = (ei, £rue) then R(move(ei, a)) = R(e[) U 
e = nf t (da{R{ei)). In particular -fi^ei) = nf (d a (R(ei)) and 
i/(a a (()-R(ei))) = true since e 6 nf c (d a (R{ei)). We have 
then: 

i?(mowe(e*, a)) = 
= R(move(ei , a)*) 
= J R(e' 1 )|e 1 |Un/ £ (|eJ|) 
= n/(a a (i?(ei)))|eI|Un/ e (K|) 
= n/ e (9 a (_R( ei ))|et|) 
= n/ e (9 a (_R( ei )|eJ|)) 
= nf e (d a (R(ei))) 

We pass now to prove the lower part of the diagram in Figure|5] 
namely that for any regular expression e, 

nf e (d a (e)) = nf e (d a (nf e (e))) 

Since however, nf c (d a (nf e (e))) = nf e (d a (nf(e))) (the derivative 
of e is empty), this is equivalent to prove the following result. 

Theorem 52. n/ 6 (<9 a (e)) = nf e (d a (nf(e))) 

Proof. The proof is by induction on e. Any induction hypothesis 
over a regular expression ei can be strengthened to nf < ,(d a (ei)e 2 ) = 
n/ e (<9 a (n/(ei))e 2 ) for all e 2 since 

n/ £ (<9a(ei)e 2 ) 

= n/ e (9 a (ei))e 2 U (n/ e (e 2 ) if v(d a (ei)) 

= nf e (d a (nf(et)))e 2 U (nf e (e 2 ) if u{d a {nf {e 1 ))) 

= n/ e (9 a (n/(ei))e 2 ) 

(observe that v{d a {e\)) = v(d a (nf(e\))) since the languages 
denoted by d a (ei) and d a (nf(ei)) are equal). 
We must consider the following cases. 

— If e is e, or a symbol b different from a then both sides of the 
equation are empty 

— If e is a, nf e (d a (a)) = <(e) = {e} = ra/ e (9 a ({a})) = 
<(9 a (n/(a))) 

— If e is ei + e 2 , 

nf e (d a (ei +e 2 )) = 
= n f e (9a(ei) + d a (e 2 )) 
= n/ £ (a a ( ei ))Un/ e (a a (e2)) 
= <(W( ei ))) U nf e (d a (nf(e 2 ))) 
= nf e (d a (nf(ei)Unf(e 2 ))) 
= nf t {d a (nf(e 1 + e 2 ))) 

— If e is eie 2 and f(ei) = false, 

nf e (d a (eie 2 )) = nf e (d a (ei)e 2 ) = n/ £ (5 a (n/(ei))e 2 ) = 
a)) = nf t (d a (nf(e 1 )e 2 )) = n/ e (a„(n/(e 1 e 2 ))) 

— If e is eie 2 and i^(ei) = tree, 

n / e (^(eie 2 )) = 
= ra/ £ (9a(ei)e 2 ) U n/ e (a a (e 2 )) 
= n/ £ (a a (n/( ei ))e 2 ) U n/ E (a a (n/(e 2 ))) 
= nf e (d a (nf(e 1 )e 2 U n/(e 2 ))) 
= nf e (d a (nf(eie 2 ))) 
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- If e is e*, 

nf e (d a (e*)) = nf e (d a {ei)e*) = nf t ,(d a (nf(ei))e*) = 
= nf e (d a (nf(e 1 )e* 1 )) = nf e (d a (nf (el))) 

Lemma 53. R(e) = nf t (R(e)) 

Proof. We proceed by induction over e: 

- J2(0) = = n/ e (0) = n/ e (ii(0)) 

- ii(e)=0 = n/ e (0) = n/ e (ii(e)) 

- Ji(a) = = n/ e (0) = n/ e (ii(a)) 

- R(ma) = {a} = n/ E ({a}) = n/ E (i?(a)) 

- fl(ei + e 2 ) = i?(ei) U R{e 2 ) = nf e (R( ei )) U nf e (R(e 2 )) = 
n/ e (i?(ei) U 7?(e 2 )) = n/ e (i?(ei + e 2 )) 

- ii(eie 2 ) = R(ex)\e 2 \UR(e 2 ) = n/ e (iZ(ei))|e 2 |Un/ < .( J R(e 3 )) = 
nf e (R( ei )\e 2 \) U nf E (R(e 2 )) = nf t (R( ei )\e 2 \ U i?(e 2 )) = 
n/ e (i?(e 1 e 2 )) 

- R(e*) = fl(e)|er = n/ £ (i?(e))|e|* = nf e (R(e)\e\*) = 
nf e (R(e*)) 

We are now ready to prove the commutation of the outermost 
diagram. 

THEOREM 54. For any pointed item e, 

R(move* (e, w)) — nf € (d w (R(e))) 

Proof. The proof is by induction on the structure of w. In the base 

case, R(move*(e, e)) = -R(e) = nf e (R(e)) = nf e (d e (R(e))). In 
the inductive step, by Theorem|52| 

R(move* (e,aw)) = 

= R(move* (move(e, a),w) 
— nf € (d w (R(move(e,a))) 
= nf e (d w (nf e (d a (R(e))))) 
= nf e (d w (d a (R(e)))) 
= nf e (d aw (R(e))) 

COROLLARY 55. For any regular expression e, 

R(move* (•e,w)) = nf e (d w (e)) 

Proof. 

R(move* (•e,w)) = nf e (d w (R(»e) ) = nf e (d w {nf e (e)) = nf e (d w (e)) 

Another important consequence of Lemmas|51|and|52|is that R 
and n/ e are admissible relations (respectively, over pres and over 
derivatives). 

THEOREM 56. kn(R(-)) (the kernel of R(-)) is an admissible 
equivalence relation over pres. 



Theorem 58. 

For each regular expression e, let D' — (Q*, E, »e, 



t',F') be 



30 



and let D e 



Proof. By Lemma 39 we derive that for all pres ei, e 2 , if R(ei) = 
R(e 2 ) then L p (ei) = L p (e 2 ). We also need to prove that 
for all pres ei,e 2 and all symbol a, if R(ei) — R(e 2 ) then 
R(move(ei, a)) — R(move(e 2 , a)). By Theorem|5"T| 

R(move(e 1 ,a)) = n/ c (9 a (i?(ei)) = nf e (d a (R(e 2 )) = 
= R(move(e 2 , a)) 

THEOREM 57. kn(nf e (e)) is an admissible equivalence relation 
over regular expressions 

Proof. By Lemma |^] we derive that for all regular expressions 
ei,e 2 , if n/ e (ei) = nf € (e 2 ) then L(ei) = L(e 2 ). We also need 
to prove that for all regular expressions ei , e 2 and all symbol a, if 
n / E ( e i) = n fA e z) then n L(9a(ei)) = nf c (d a (e 2 )). 
By Theorem|52| 

n/ e (5 a (ei)) = n/ e (a a (n/ e (ei)) = nf e (d a (nf e (e 2 )) = 
= nf e (d a (e 2 )) 



the automaton for e built according to Definition . 
(Q S , S, e, t s , F s ) the automaton for e obtained with derivatives. 
Let kn(R) and kn(nf t ) be the kernels of R and nf e respectively. 
Then D' e / kn ^ R ) — D e /i,n(nf e }- 

Proof. The results holds by commutation of Figure[5| that is granted 
by the previous results, in particular by Corollary[55] Theorem|56| 
Theorem |57| and the commutation of the triangles relative to the 
initial and final states. 

Theorem [58] relates our finite automata with the infinite states 
ones obtained via Brzozowski's derivatives before quotienting the 
automata states by means of AC I to make them finite. The follow- 
ing easy lemma shows that k n(nf f ) is an equivalence relation finer 
than AC I and thus Theorem [58] also holds for the standard finite 
Brzozowski's automata since we can quotient with ACT first. 

LEMMA 59. Let ei and e 2 be regular expressions. If ei =aci e 2 
then n/ E (ei) = n/ e (e 2 ). 

5. Merging 



By Theorem 16 L p (»e) = L p (e) U £(|e|). A more syntactic way 
to look at this result is to observe that •(e) can be obtained by 
"merging" together the points in e and »(|ej), and that the language 
defined by merging two pointed expressions e\ and e 2 is just the 
union of the two languages L p (e\) and L p (e 2 ). The merging oper- 
ation, that we shall denote with a t, does also provide the relation 
between deterministic and nondeterministic automata where, as in 
Watson 1 1 Oil 111 , we may label states with expressions with a single 
point (for lack of space, we shall not explicitly address the latter 
issue in this paper, that is however a simple consequence of The- 
orem [67). Finally, the merging operation will allow us to explain 
why the technique of pointed expressions cannot be (naively) gen- 
eralized to intersection and complement (see Section [5~T) . 

DEFINITION 60. Let ei and e 2 be two items on the same carrier 
|e|. The merge of ei and e 2 is defined by the following rules by 
recursion over the structure of e: 



t 

a] a 
•a | a 
a f »a 
•a f »a 

(e\+el)](e\+e 2 2 ) 
(e\e\) f (efe|) 



c 
a 

• a 

• a 

• a 



(el t e?) + (e\ t e\) 
(e\\el)(e\\el) 
(ei f e 2 )* 



The definition is extended to pres as follows: 

(ei,&i) f (e 2 ,b 2 ) = (ei i e 2 ,&i V b 2 ) 

THEOREM 61. t is commutative, associative and idempotent 

Proof. Trivial by induction over the structure of the carrier of the 
arguments. 

Theorem 62. L v (e\ t e 2 ) = L p (ei) u L p (e 2 ) 

Proof. Trivial by induction on the common carrier of the items of 

ei and e 2 . 

All the constructions we presented so far commute with the 
merge operation. Since merging essentially corresponds to the sub- 
set construction over automata, the following theorems constitute 
the proof of correctness of the subset construction. 
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Theorem 63. (ei t e?) © (e 2 f el) = (ei © e 2 ) f (e 2 © e|) 
Prao/ Trivial by expansion of definitions. 
Theorem 64. 

1. for ei and e2 items on the same carrier, 

•(ei f e 2 ) = »(ei) t {e 2 , false) 

2. for ei and C2 prej on fne same carrier, 

•(ei f e2) = »(ei) t e 2 

3. {e\\e\)Q{e\\el) = {e\Qe 1 2 )\{e\Qel) 
Corollary 65. 

•(ei f e 2 ) = ei f .(e 2 ) = .(ei) t »(e 2 ) 

Proof .[of the corollary] The corollary is a simple consequence of 
commutativity of] and idempotence £>/•(•)•' 

• (ei f e 2 ) = »(e 2 t ei) = »(e 2 ) f ei = ei f «(e 2 ) 

•(ei f e 2 ) = •(•(ei f e 2 )) = •(•(e i ) t e 2 ) = »(ei) t »(e 2 ) 

/Voo/[of 1.] We first prove »(ei f e 2 ) = »(ei) f (e 2 , false) by 
induction over the structure of the common carrier of ei and e 2 , 
assuming that 3. holds on terms whose carrier is structurally smaller 
than e. 

— If |ei| is 0, e, a, »a then trivial 

— If ei is e\ + e\ and e 2 is e 2 + e 2 : 

•((ei + e?)t(ej + e2)) = 

= -((citeJ) + (e?t^)) 

= «(ci t ej) © .(e? f el) 

= («(ei) t (eljalse)) © (.(e?) t (el false)) 

= («(ci) © «(e 2 )) t ((el, false) © (el, false)) 

= »(e\ + el)] (e\ + el,false) 

— If ei is eie 2 and e 2 is e 2 e 2 then, using 3. on items whose carrier 
is structurally smaller than \e\_\, 

•((eie?) t (ejel)) = 
= «((eJte^)(e?tel)) 
= .(eite^©(e?te|,/dse> 
= («(ci) t (ej,/afac>) © ((elfalse) f (el,/aZ S e» 
= (•(el) (eljalse)) t ((e 2 ,/aZse> (elfalse))) 
= •(eie?) f (e\el false) 

— If ei is e}* and e 2 is e 2 *, let »(e\ \ e 2 ) = (e' , b') and «(e}) = 
(e",b"). By induction hypothesis, (e',b') = »(e} f e 2 ) = 
•(ei) t (el, false) = (e", 6") t (eljalse) Then 

.(ef f eD = «((ei J ej)') = (e 1 * , true) =^ 

= (e"*, true) t (el* , false) — »(ei*) t (el* , false) 

Proof. [Of 2.] Let (e'j, b'j) = e). By definition of t, we have 

eUei = (e' 1 1 U?Xyb?) 



For all b and e, let •b(e) := 



if b = /aZse 



• (e) otherwise 
Thus for all ei,e 2 ,6i,6 2 , letting <e 2 ',& 2 '> := ((e' 2 , b' 2 )), the 
following holds: 

(e'i, b[) (e' 2 ,b' 2 ) = (e x e 2 , b' 2 V b 2 ) 

Let (e 2 l , fo^' 1 } := » b n (e 2 ). By property 1. we have: 

(e' 2 \e' 2 ,b' 2 V b' 2 ) = • i /i(e 2 ) f • b '2(e 2 ) = » b n v(/ 2 (e 2 t e 2 ) 



Thus 

(ei t c?) (cj t e|) = 
= (ei 1 tei 2 ,6i 1 V6i 2 >©(eUel) 
= {(ei 1 f ef)(e 2 ' 1 f e' 2 ' 2 ), b? V 6 2 2 V b' 2 n V & 2 ' 2 > 
= ((e'M' 1 ) t (e?e 2 ' 2 ), (tf V 6,' 1 ) V (6 2 2 V 6 2 ' 2 )} 
= <ei 1 e 2 ' 1 ,6 2 1 V6 2 ' 1 )t(e' 1 2 e2 ' 2 ,& 2 2 Vb 2 ' 2 ) 
= (ei©ei)t(e?©eg) 

Theorem 66. (ei t e 2 )* = e\\e* 2 

Proof. Let ei = (e 1 , bi) and e 2 = (e 2 , 6 2 ). Thus 

((eL6i>t<e 2 ,6 2 ))* = <e 1 te^6iV& 2 >* 

Let define e', ei and e 2 by cases on b\ and & 2 with the property 
that e' = ei ] e' 2 : 

— If 6i = 6 2 = /aZse then let e^ = ei and e' = ei f e 2 . Obviously 
e' = eite 2 . 

— If b\ = true and 6 2 = false then let »(ej) = (e'i,b[), let 
e 2 = ei and let .(ei t 4) = «(ei) t (eljalse) = (e',b'). 
Hence e'i f e 2 = e'i f e 2 = e'. 

— The case b\ = false and b 2 = Jrwe is handled dually to the 
previous one. 

— If b\ = true and b 2 = true then let •(e i ) = (e^, b'j) and let 
•(ei t el) = .(ei) f «(ej) = <e', b'). Hence el f e 2 = e'. 

In all cases, 

(ei f el, bi V 6 2 )* = (e'*, 6i V b 2 ) = ((ei f e' 2 )*, bi V b 2 ) = 
= (erte 2 *,6iV6 2 ) = (er,6i)t(e 2 ,6 2 *> 
= (e\MY ] (e\,b 2 y 

THEOREM 67 . move(ei f e 2 , a) = move(ei , a) f move(e 2 , a) 

Proof. The proof is by induction on the structure of e. 

— the cases 0, e and 6 7^ a are trivial by computation 

— the case a has four sub-cases: if ei and e 2 are both a, then 
move(a\a,a) = (0, false) = move(a,a) t move(a,a); 
otherwise at least one in ei or e 2 is .a and move(ei f e 2 , a) = 
move(»a,a) — (a, true) = move(e 1 ,a) t move(e 2 ,a) 

— if e is e 1 + e then 

move((e\ + e?) f (e 2 + e|), a) = 
= moue((ei f e 2 ) + (e 2 f el), a) 
= move(e\ t e 2 , a) © move(e\ \ e 2 , a) 
= (mo«e(ei, a) t move(el, a)) © (move(e\, a) f mowe(e 2 
= (mo«e(e},a) ©move(e?,a)) f (rnove(el,a) ®move(e\ 
= move(e{ + e 2 , a) f mowe(e 2 + el, a) 

— if e is e x e 2 then 

mo«e((e}e 2 ) f (e 2 el),a) = 
= moue((ei f e 2 )(e 2 f el), a) 
= move(e{ f e 2 , a) move(e\ ] e 2 , a) 
= (moue(ei, a) f move(el, a)) (move(e\, a) t move(el 
= (mowe(ei, a) move(e\, a)) t (move(e 2 , a) mowe(el 
= move(e\e\, a) f mowe(e 2 el, a) 

— if e is e 1 then 

move(e\* f e 2 *) = rnove((e\ t e 2 )*) = 

= move(e\ t e 2 )* = (move(e}) f move(el))* 

= mo«e(ei)* f moiie(e 2 )* = moue(ei ) f mowe(e 2 ) 

5.1 Intersection and complement 

Pointed expressions cannot be generalized in a trivial way to the 
operations of intersection and complement. Suppose to extend the 
definition of the language in the obvious way, letting L v (e\ He 2 ) = 
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Lp(ei) n L v (e2) and L p (-^e) — L p (e). The problem is that 
merging is no longer additive, and Theorem [16] does not hold any 
more. For instance, consider the two expressions ei = »a PI a and 
62 = afl »a. Clearly L p (ei) = L p (e2) = 0, but I/ P (eife2) = 
L p (»a n »a) = {a}. To better understand the problem, let e = 
(•ba n »a) | • b, and let us consider the result of move(e* , b). Since 
move(e, b) = {{b • o PI true), we should broadcast a new 

point inside (6»aria)|6), hence mcme(e* , 6) = (•6»an»a) •£>)*, 
that is obviously wrong. 

The problems in extending the technique to intersection and 
complement are not due to some easily avoidable deficiency of the 
approach but have a deep theoretical reason: indeed, even if these 
operators do not increase the expressive power of regular expres- 
sions they can have a drastic impact on succinctness, making them 
much harder to handle. For instance it is well known that expres- 
sions with complements can provide descriptions of certain lan- 
guages which are non-elementary more compact than standard reg- 
ular expression 1 15). Gelade 1 12] has recently proved that for any 
natural number n there exists a regular expression with intersec- 
tion of size 0(n) such that any DFA accepting its language has 
a double-exponential size, i.e. it contains at least 2 2 states (see 
also 1131 ). Hence, marking positions with points is not enough, just 
because we would not have enough states. 

Since the problem is due to a loss of information during merg- 
ing, we are currently investigating the possibility to exploit colored 
points. An important goal of this approach would be to provide sim- 
ple, completely syntactic explanations for space bounds of different 
classes of languages. 

6. Conclusions 

We introduced in this paper the notion of pointed regular expres- 
sion, investigated its main properties, and its relation with Brzo- 
zowski's derivatives. Points are used to mark the positions inside 
the regular expression which have been reached after reading some 
prefix of the input string, and where the processing of the remain- 
ing string should start. In particular, each pointed expression has a 
clear semantics. Since each pointed expression for e represents a 
state of the deterministic automaton associated with e, this means 
we may associate a semantics to each state in terms of the specifi- 
cation e and not of the behaviour of the automaton. This allows a 
direct, intuitive and easily verifiable construction of the determin- 
istic automaton for e. 

A major advantage of pointed expressions is from the didactical 
point of view. Relying on an electronic device, it is a real pleasure 
to see points moving inside the regular expression in response to an 
input symbol. Students immediately grasp the idea, and are able 
to manually build the automata, and to understand the meaning 
of its states, after a single lesson. Moreover, if you have a really 
short time, you can altogether skip the notion of nondeterministic 
automata. 

Regular expression received a renewed interest in recent years, 
mostly due to their use in XML-languages. Pointed expressions 
seem to open a huge range of novel perspectives and original ap- 
proaches in the field, starting from the challenging generalization 
of the approach to different operators such as counting, intersec- 
tion, and interleaving (e.g. exploiting colors for points, see Section 
|5 - 1 1 > - A large amount of research has been recently devoted to the 
so called succinteness problem, namely the investigation of the de- 
scriptional complexity of regular languages (see e.g. I12II13|[T41 ). 
Since, as observed in ExampkjTO] pointed expression can provide 
a more compact description for regular languages than traditional 
regular expression, it looks interesting to better investigated this is- 
sue (that seems to be related to the so called star-height 1 16 1 of the 
language). 



It could also be worth to investigate variants of the notion of 
pointed expression, allowing different positioning of points inside 
the expressions. Merging must be better investigated, and the whole 
equational theory of pointed expressions, both with different and 
(especially) fixed carriers must be entirely developed. 

As explained in the introduction, the notion of pointed expres- 
sion was suggested by an attempt of formalizing the theory of reg- 
ular languages by means of an interactive prover. This testify the 
relevance of the choice of good data structures not just for the de- 
sign of algorithms but also for the formal investigation of a given 
field, and is a nice example of the kind of interesting feedback one 
may expect by the interplay with automated devices for proof de- 
velopment. 
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